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Abstract 

Even  in  a  massive  corpus  such  as  the  Web,  a 
substantial  fraction  of  extractions  appear  in¬ 
frequently.  This  paper  shows  how  to  assess 
the  correctness  of  sparse  extractions  by  uti¬ 
lizing  unsupervised  language  models.  The 
Realm  system,  which  combines  HMM- 
based  and  ro-gram-based  language  models, 
ranks  candidate  extractions  by  the  likeli¬ 
hood  that  they  are  collect.  Our  experiments 
show  that  Realm  reduces  extraction  error 
by  39%,  on  average,  when  compared  with 
previous  work. 

Because  Realm  pre-computes  language 
models  based  on  its  corpus  and  does  not  re¬ 
quire  any  hand-tagged  seeds,  it  is  far  more 
scalable  than  approaches  that  learn  mod¬ 
els  for  each  individual  relation  from  hand- 
tagged  data.  Thus,  Realm  is  ideally  suited 
for  open  information  extraction  where  the 
relations  of  interest  arc  not  specified  in  ad¬ 
vance  and  their  number  is  potentially  vast. 

1  Introduction 

Information  Extraction  (IE)  from  text  is  far  from  in¬ 
fallible.  In  response,  researchers  have  begun  to  ex¬ 
ploit  the  redundancy  in  massive  corpora  such  as  the 
Web  in  order  to  assess  the  veracity  of  extractions 
(. e.g .,  (Downey  et  al.,  2005;  Etzioni  et  al.,  2005; 
Feldman  et  al.,  2006)).  In  essence,  such  methods  uti¬ 
lize  extraction  patterns  to  generate  candidate  extrac¬ 
tions  {e.g.,  “Istanbul”)  and  then  assess  each  candi¬ 
date  by  computing  co-occurrence  statistics  between 


the  extraction  and  words  or  phrases  indicative  of 
class  membership  {e.g.,  “cities  such  as”). 

However,  Zipf’s  Law  governs  the  distribution  of 
extractions.  Thus,  even  the  Web  has  limited  redun¬ 
dancy  for  less  prominent  instances  of  relations.  In¬ 
deed,  50%  of  the  extractions  in  the  data  sets  em¬ 
ployed  by  (Downey  et  al.,  2005)  appeared  only 
once.  As  a  result,  Downey  et  ah' s  model,  and  re¬ 
lated  methods,  had  no  way  of  assessing  which  ex¬ 
traction  is  more  likely  to  be  correct  for  fully  half  of 
the  extractions.  This  problem  is  particularly  acute 
when  moving  beyond  unary  relations.  We  refer  to 
this  challenge  as  the  task  of  assessing  sparse  extrac¬ 
tions. 

This  paper  introduces  the  idea  that  language  mod¬ 
eling  techniques  such  as  n-gram  statistics  (Manning 
and  Schiitze,  1999)  and  HMMs  (Rabiner,  1989)  can 
be  used  to  effectively  assess  sparse  extractions.  The 
paper  introduces  the  Realm  system,  and  highlights 
its  unique  properties.  Notably,  Realm  does  not 
require  any  hand-tagged  seeds,  which  enables  it  to 
scale  to  Open  IE — extraction  where  the  relations  of 
interest  are  not  specified  in  advance,  and  their  num¬ 
ber  is  potentially  vast  (Banko  et  al.,  2007). 

Realm  is  based  on  two  key  hypotheses.  The 
KnowItAll  hypothesis  is  that  extractions  that  oc¬ 
cur  more  frequently  in  distinct  sentences  in  the 
corpus  are  more  likely  to  be  correct.  For  exam¬ 
ple,  the  hypothesis  suggests  that  the  argument  pair 
(Giuliani,  New  York)  is  relatively  likely  to  be 
appropriate  for  the  Mayor  relation,  simply  because 
this  pair  is  extracted  for  the  Mayor  relation  rela¬ 
tively  frequently.  Second,  we  employ  an  instance  of 
the  distributional  hypothesis  (Harris,  1985),  which 
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can  be  phrased  as  follows:  different  instances  of 
the  same  semantic  relation  tend  to  appeal-  in  sim¬ 
ilar  textual  contexts.  We  assess  sparse  extractions 
by  comparing  the  contexts  in  which  they  appeal-  to 
those  of  more  common  extractions.  Sparse  extrac¬ 
tions  whose  contexts  are  more  similar  to  those  of 
common  extractions  are  judged  more  likely  to  be 
correct  based  on  the  conjunction  of  the  KnowItAll 
and  the  distributional  hypotheses. 

The  contributions  of  the  paper  are  as  follows: 

•  The  paper  introduces  the  insight  that  the  sub¬ 
field  of  language  modeling  provides  unsuper¬ 
vised  methods  that  can  be  leveraged  to  assess 
sparse  extractions.  These  methods  are  more 
scalable  than  previous  assessment  techniques, 
and  require  no  hand  tagging  whatsoever. 

•  The  paper  introduces  an  HMM-based  tech¬ 
nique  for  checking  whether  two  arguments  are 
of  the  proper  type  for  a  relation. 

•  The  paper  introduces  a  relational  n- gram 
model  for  the  purpose  of  determining  whether 
a  sentence  that  mentions  multiple  arguments 
actually  expresses  a  particular  relationship  be¬ 
tween  them. 

•  The  paper  introduces  a  novel  language¬ 
modeling  system  called  Realm  that  combines 
both  HMM-based  models  and  relational  n- 
gram  models,  and  shows  that  Realm  reduces 
error  by  an  average  of  39%  over  previous  meth¬ 
ods,  when  applied  to  sparse  extraction  data. 

The  remainder  of  the  paper  is  organized  as  fol¬ 
lows.  Section  2  introduces  the  IE  assessment  task, 
and  describes  the  Realm  system  in  detail.  Section 
3  reports  on  our  experimental  results  followed  by  a 
discussion  of  related  work  in  Section  4.  Finally,  we 
conclude  with  a  discussion  of  scalability  and  with 
directions  for  future  work. 

2  IE  Assessment 

This  section  formalizes  the  IE  assessment  task  and 
describes  the  Realm  system  for  solving  it.  An  IE 
assessor  takes  as  input  a  list  of  candidate  extractions 
meant  to  denote  instances  of  a  relation,  and  outputs 
a  ranking  of  the  extractions  with  the  goal  that  cor¬ 
rect  extractions  rank  higher  than  incorrect  ones.  A 
correct  extraction  is  defined  to  be  a  true  instance  of 
the  relation  mentioned  in  the  input  text. 


More  formally,  the  list  of  candidate  extrac¬ 
tions  for  a  relation  R  is  denoted  as  Er  = 
{(ai,  6i), . . . ,  (am,  bm)}.  An  extraction  (a*,  bi)  is 
an  ordered  pair  of  strings.  The  extraction  is  correct 
if  and  only  if  the  relation  R  holds  between  the  argu¬ 
ments  named  by  a,  and  bi.  For  example,  for  R  = 
Headquartered,  a  pair  (a*,  bi)  is  correct  iff  there 
exists  an  organization  a,  that  is  in  fact  headquartered 
in  the  location  bi.1 

Er  is  generated  by  applying  an  extraction  mech¬ 
anism,  typically  a  set  of  extraction  “patterns”,  to 
each  sentence  in  a  corpus,  and  recording  the  results. 
Thus,  many  elements  of  Er  are  identical  extractions 
derived  from  different  sentences  in  the  corpus. 

This  task  definition  is  notable  for  the  minimal 
inputs  required — IE  assessment  does  not  require 
knowing  the  relation  name  nor  does  it  require  hand- 
tagged  seed  examples  of  the  relation.  Thus,  an  IE 
Assessor  is  applicable  to  Open  IE. 

2.1  System  Overview 

In  this  section,  we  describe  the  Realm  system, 
which  utilizes  language  modeling  techniques  to  per¬ 
form  IE  Assessment. 

Realm  takes  as  input  a  set  of  extractions  Er, 
and  outputs  a  ranking  of  those  extractions.  The 
algorithm  Realm  follows  is  outlined  in  Figure  1. 
Realm  begins  by  automatically  selecting  from  Er 
a  set  of  bootstrapped  seeds  Sr  intended  to  serve  as 
correct  examples  of  the  relation  R.  Realm  utilizes 
the  KnowItAll  hypothesis,  setting  Sr  equal  to  the 
h  elements  in  Er,  extracted  most  frequently  from 
the  underlying  corpus.  This  results  in  a  noisy  set  of 
seeds,  but  the  methods  that  use  these  seeds  are  noise 
tolerant. 

Realm  then  proceeds  to  rank  the  remaining 
(non-seed)  extractions  by  utilizing  two  language¬ 
modeling  components.  An  n-gram  language  model 
is  a  probability  distribution  P(w\ , ...,  wn)  over  con¬ 
secutive  word  sequences  of  length  n  in  a  corpus. 
Formally,  if  we  assume  a  seed  (51,32)  is  a  correct 
extraction  of  a  relation  II.  the  distributional  hypoth¬ 
esis  states  that  the  context  distribution  around  the 
seed  extraction,  P(w±, ...,  wn\wi  =  s\,Wj  =  S2) 
for  1  <  i,j  <  n  tends  to  be  “more  similar”  to 

'For  clarity,  our  discussion  focuses  on  relations  between 
pairs  of  arguments.  However,  the  methods  we  propose  can  be 
extended  to  relations  of  any  arity. 


P(w\, ...,  wn\wi  =  e\ ,Wj  =  e-i)  when  the  extrac¬ 
tion  (ei,e2)  is  coiTect.  Naively  comparing  context 
distributions  is  problematic,  however,  because  the 
arguments  to  a  relation  often  appeal-  separated  by 
several  intervening  words.  In  our  experiments,  we 
found  that  when  relation  arguments  appeal-  together 
in  a  sentence,  75%  of  the  time  the  arguments  are 
separated  by  at  least  three  words.  This  implies  that 
n  must  be  large,  and  for  sparse  argument  pairs  it  is 
not  possible  to  estimate  such  a  large  language  model 
accurately,  because  the  number  of  modeling  param¬ 
eters  is  proportional  to  the  vocabulary  size  raised  to 
the  nth  power.  To  mitigate  sparsity,  Realm  utilizes 
smaller  language  models  in  its  two  components  as  a 
means  of  “backing-off’  from  estimating  context  dis¬ 
tributions  explicitly,  as  described  below. 

First,  Realm  utilizes  an  HMM  to  estimate 
whether  each  extraction  has  arguments  of  the  proper 
type  for  the  relation.  Each  relation  R  has  a  set 
of  types  for  its  arguments.  For  example,  the  rela¬ 
tion  AuthorOf  (a,  b)  requires  that  its  first  ar¬ 
gument  be  an  author,  and  that  its  second  be  some 
kind  of  written  work.  Knowing  whether  extracted 
arguments  are  of  the  proper  type  for  a  relation  can 
be  quite  informative  for  assessing  extractions.  The 
challenge  is,  however,  that  this  type  information  is 
not  given  to  the  system  since  the  relations  (and  the 
types  of  the  arguments)  are  not  known  in  advance. 
Realm  solves  this  problem  by  comparing  the  dis¬ 
tributions  of  the  seed  arguments  and  extraction  ar¬ 
guments.  Type  checking  mitigates  data  sparsity  by 
leveraging  every  occurrence  of  the  individual  extrac¬ 
tion  arguments  in  the  corpus,  rather  than  only  those 
cases  in  which  argument  pairs  occur  near  each  other. 

Although  argument  type  checking  is  invalu¬ 
able  for  extraction  assessment,  it  is  not  suf¬ 
ficient  for  extracting  relationships  between  ar¬ 
guments.  For  example,  an  IE  system  us¬ 
ing  only  type  information  might  determine  that 
Intel  is  a  corporation  and  that  Seattle  is 
a  city,  and  therefore  erroneously  conclude  that 
Headquartered  ( Intel,  Seattle)  is  cor¬ 
rect.  Thus,  Realm’s  second  step  is  to  employ  an 
n-gram-based  language  model  to  assess  whether  the 
extracted  arguments  share  the  appropriate  relation. 
Again,  this  information  is  not  given  to  the  system, 
so  Realm  compares  the  context  distributions  of  the 
extractions  to  those  of  the  seeds.  As  described  in 


Realm  {Extractions  Er  =  {ei, ...,  em}) 

Sr  =  the  h  most  frequent  extractions  in  Er 
Ur  =  Er-  Sr 

TypeRankings(UR )  <—  Hmm-T(5a,  Ur) 
RelationRankings(UR)  <—  Rel-grams(5'h,  Ur) 
return  a  ranking  of  Er  with  the  elements  of  Sr  at  the 
top  (ranked  by  frequency)  followed  by  the  elements  of 
Ur  =  (to, ...,  Um-h}  ranked  in  ascending  order  of 
TypeRanking(ui)  *  RelationRanking(ui). 


Figure  1:  Pseudocode  for  Realm  at  run-time. 
The  language  models  used  by  the  Hmm-T  and 
Rel-GRAMS  components  are  constructed  in  a  pre¬ 
processing  step. 

Section  2.3,  Realm  employs  a  relational  n-gram 
language  model  in  order  to  accurately  compare  con¬ 
text  distributions  when  extractions  are  sparse. 

Realm  executes  the  type  checking  and  relation 
assessment  components  separately;  each  component 
takes  the  seed  and  non-seed  extractions  as  arguments 
and  returns  a  ranking  of  the  non-seeds.  Realm  then 
combines  the  two  components’  assessments  into  a 
single  ranking.  Although  several  such  combinations 
are  possible.  Realm  simply  ranks  the  extractions  in 
ascending  order  of  the  product  of  the  ranks  assigned 
by  the  two  components.  The  following  subsections 
describe  Realm’s  two  components  in  detail. 

We  identify  the  proper  nouns  in  our  corpus  us¬ 
ing  the  Lex  method  (Downey  et  ah,  2007).  In  ad¬ 
dition  to  locating  the  proper  nouns  in  the  corpus, 
Lex  also  concatenates  each  multi-token  proper  noun 
(eg., Los  Angeles)  together  into  a  single  token. 
Both  of  Realm’s  components  construct  language 
models  from  this  tokenized  corpus. 

2.2  Type  Checking  with  Hmm-T 

In  this  section,  we  describe  our  type-checking  com¬ 
ponent,  which  takes  the  form  of  a  Hidden  Markov 
Model  and  is  referred  to  as  Hmm-T.  Hmm-T  ranks 
the  set  Ur  of  non-seed  extractions,  with  a  goal  of 
ranking  those  extractions  with  arguments  of  proper 
type  for  R  above  extractions  containing  type  errors. 
Lormally,  let  Ur,  denote  the  set  of  the  /th  arguments 
of  the  extractions  in  Ur.  Let  Sr,  be  defined  simi¬ 
larly  for  the  seed  set  Sr. 

Our  type  checking  technique  exploits  the  distri¬ 
butional  hypothesis — in  this  case,  the  intuition  that 


Intel  ,  headquartered  in  Santa+Clara 

Figure  2:  Graphical  model  employed  by  Hmm- 
T.  Shown  is  the  case  in  which  k  =  2.  Corpus 
pre-processing  results  in  the  proper  noun  Santa 
Clara  being  concatenated  into  a  single  token. 

extraction  arguments  in  Um  of  the  proper  type  will 
likely  appear  in  contexts  similar  to  those  in  which 
the  seed  arguments  Sj a  appear.  In  order  to  iden¬ 
tify  terms  that  arc  distributionally  similar,  we  train 
a  probabilistic  generative  Hidden  Markov  Model 
(HMM),  which  treats  each  token  in  the  corpus  as 
generated  by  a  single  hidden  state  variable.  Here,  the 
hidden  states  take  integral  values  from  {1. . . . .  T\, 
and  each  hidden  state  variable  is  itself  generated  by 
some  number  k  of  previous  hidden  states.2  For¬ 
mally,  the  joint  distribution  of  the  corpus,  repre¬ 
sented  as  a  vector  of  tokens  w,  given  a  correspond¬ 
ing  vector  of  states  t  is: 

P(w|f)  =  ti-k )  (1) 

i 

The  distributions  on  the  right  side  of  Equation  1 
can  be  learned  from  a  corpus  in  an  unsupervised 
manner,  such  that  words  which  are  distributed  sim¬ 
ilarly  in  the  corpus  tend  to  be  generated  by  simi¬ 
lar  hidden  states  (Rabiner,  1989).  The  generative 
model  is  depicted  as  a  Bayesian  network  in  Figure  2. 
The  figure  also  illustrates  the  one  way  in  which  our 
implementation  is  distinct  from  a  standard  HMM, 
namely  that  proper  nouns  arc  detected  a  priori  and 
modeled  as  single  tokens  ( e.g .,  Santa  Clara  is 
generated  by  a  single  hidden  state).  This  allows 
the  type  checker  to  compare  the  state  distributions 
of  different  proper  nouns  directly,  even  when  the 
proper  nouns  contain  differing  numbers  of  words. 

To  generate  a  ranking  of  Ur  using  the  learned 
HMM  parameters,  we  rank  the  arguments  e,  accord¬ 
ing  to  how  similar  their  state  distributions  P(t\et) 

2Our  implementation  makes  the  simplifying  assumption  that 
each  sentence  in  the  corpus  is  generated  independently. 


arc  to  those  of  the  seed  arguments.3  Specifically,  we 
define  a  function: 

/(e)  =  YJKL(^w'eS^P^W\p{t\ei))  (2) 
eiSe  I  Rt' 

where  KL  represents  KL  divergence,  and  the  outer 
sum  is  taken  over  the  arguments  er  of  the  extraction 
e.  We  rank  the  elements  of  Ur  in  ascending  order  of 

/(e)- 

Hmm-T  has  two  advantages  over  a  more  tradi¬ 
tional  type  checking  approach  of  simply  counting 
the  number  of  times  in  the  corpus  that  each  extrac¬ 
tion  appeal's  in  a  context  in  which  a  seed  also  ap¬ 
pears  ( cf  (Ravichandran  et  ah,  2005)).  The  first 
advantage  of  Hmm-T  is  efficiency,  as  the  traditional 
approach  involves  a  computationally  expensive  step 
of  retrieving  the  potentially  large  set  of  contexts  in 
which  the  extractions  and  seeds  appeal'.  In  our  ex¬ 
periments,  using  Hmm-T  instead  of  a  context-based 
approach  results  in  a  10-50x  reduction  in  the  amount 
of  data  that  is  retrieved  to  perform  type  checking. 
Secondly,  on  sparse  data  Hmm-T  has  the  poten¬ 
tial  to  improve  type  checking  accuracy.  For  exam¬ 
ple,  consider  comparing  Pickerington,  a  sparse 
candidate  argument  of  the  type  City,  to  the  seed 
argument  Chicago,  for  which  the  following  two 
phrases  appeal'  in  the  corpus: 

(i)  “Pickerington,  Ohio” 

(ii)  “Chicago,  Illinois” 

In  these  phrases,  the  textual  contexts  surrounding 
Chicago  and  Pickerington  are  not  identical, 
so  to  the  traditional  approach  these  contexts  offer 
no  evidence  that  Pickerington  and  Chicago 
are  of  the  same  type.  For  a  sparse  token  like 
Pickerington,  this  is  problematic  because  the 
token  may  never  occur  in  a  context  that  precisely 
matches  that  of  a  seed.  In  contrast,  in  the  HMM,  the 
non-sparse  tokens  Ohio  and  Illinois  are  likely 
to  have  similar  state  distributions,  as  they  are  both 
the  names  of  U.S.  States.  Thus,  in  the  state  space 
employed  by  the  HMM,  the  contexts  in  phrases  (i) 
and  (ii)  are  in  fact  quite  similar,  allowing  Hmm- 
T  to  detect  that  Pickerington  and  Chicago 
are  likely  of  the  same  type.  Our  experiments  quan¬ 
tify  the  performance  improvements  that  Hmm-T  of- 

3The  distribution  P(t\d)  for  any  e,  can  be  obtained  from 
the  HMM  parameters  using  Bayes  Rule. 


fers  over  the  traditional  approach  for  type  checking 
sparse  data. 

The  time  required  to  learn  Hmm-T’s  parameters 
scales  proportional  to  Tk+1  times  the  corpus  size. 
Thus,  for  tractability,  Hmm-T  uses  a  relatively  small 
state  space  of  T  =  20  states  and  a  limited  k  value 
of  3.  While  these  settings  are  sufficient  for  type 
checking  ( e.g .,  determining  that  Santa  Clara  is 
a  city)  they  arc  too  coarse-grained  to  assess  relations 
between  arguments  {e.g.,  determining  that  Santa 
Clara  is  the  particular  city  in  which  Intel  is 
headquartered).  We  now  turn  to  the  Rel-GRAMS 
component,  which  performs  the  latter  task. 

2.3  Relation  Assessment  with  Rel-grams 

Realm’s  relation  assessment  component,  called 
Rel-grams,  tests  whether  the  extracted  arguments 
have  a  desired  relationship,  but  given  Realm’s  min¬ 
imal  input  it  has  no  a  priori  information  about  the 
relationship.  Rel-GRAMS  relies  instead  on  the  dis¬ 
tributional  hypothesis  to  test  each  extraction. 

As  argued  in  Section  2.1,  it  is  intractable  to  build 
an  accurate  language  model  for  context  distributions 
surrounding  sparse  argument  pairs.  To  overcome 
this  problem,  we  introduce  relational  n- gram  mod¬ 
els.  Rather  than  simply  modeling  the  context  distri¬ 
bution  around  a  given  argument,  a  relational  n-gram 
model  specifies  separate  context  distributions  for  an 
arguments  conditioned  on  each  of  the  other  argu¬ 
ments  with  which  it  appears.  The  relational  n-gram 
model  allows  us  to  estimate  context  distributions  for 
pairs  of  arguments,  even  when  the  arguments  do  not 
appear  together  within  a  fixed  window  of  n  words. 
Further,  by  considering  only  consecutive  argument 
pairs,  the  number  of  distinct  argument  pairs  in  the 
model  grows  at  most  linearly  with  the  number  of 
sentences  in  the  corpus.  Thus,  the  relational  n-gram 
model  can  scale. 

Formally,  for  a  pair  of  arguments  (ei,e2),  a  re¬ 
lational  n-gram  model  estimates  the  distributions 
P{w\,  ...,wn\wi  =  ei,ei  e2)  for  each  1  <  i  < 
n,  where  the  notation  e\  <->  e2  indicates  the  event 
that  e2  is  the  next  argument  to  either  the  right  or  the 
left  of  e\  in  the  corpus. 

Rel-grams  begins  by  building  a  relational  n- 
gram  model  of  the  arguments  in  the  corpus.  For 
notational  convenience,  we  represent  the  model’s 
distributions  in  terms  of  “context  vectors”  for  each 


pair  of  arguments.  Formally,  for  a  given  sentence 
containing  arguments  ei  and  e2  consecutively,  we 
define  a  context  of  the  ordered  pair  (ei,  e2)  to  be 
any  window  of  n  tokens  around  e±.  Let  C  = 
{ci,  C2, ...,  c\c\)  be  the  set  of  all  contexts  of  all  ar¬ 
gument  pairs  found  in  the  corpus.4  For  a  pair  of  ar¬ 
guments  (e},  e/J,  we  model  their  relationship  using 
a  \C\  dimensional  context  vector  U(e  -,efc),  whose  z-th 
dimension  corresponds  to  the  number  of  times  con¬ 
text  Ci  occurred  with  the  pair  ie3,  e &)  in  the  corpus. 
These  context  vectors  arc  similar  to  document  vec¬ 
tors  from  Information  Retrieval  (IR),  and  we  lever¬ 
age  IR  research  to  compare  them,  as  described  be¬ 
low. 

To  assess  each  extraction,  we  determine  how  sim¬ 
ilar  its  context  vector  is  to  a  canonical  seed  vec¬ 
tor  (created  by  summing  the  context  vectors  of  the 
seeds).  While  there  arc  many  potential  methods 
for  determining  similarity,  in  this  work  we  rank  ex¬ 
tractions  by  decreasing  values  of  the  BM25  dis¬ 
tance  metric.  BM25  is  a  TF-IDF  valiant  intro¬ 
duced  in  TREC-3(Robertson  et  al.,  1992),  which 
outperformed  both  the  standard  cosine  distance  and 
a  smoothed  KL  divergence  on  our  data. 

3  Experimental  Results 

This  section  describes  our  experiments  on  IE  assess¬ 
ment  for  sparse  data.  We  start  by  describing  our 
experimental  methodology,  and  then  present  our  re¬ 
sults.  The  first  experiment  tests  the  hypothesis  that 
Hmm-T  outperforms  an  n-gram-based  method  on 
the  task  of  type  checking.  The  second  experiment 
tests  the  hypothesis  that  Realm  outperforms  multi¬ 
ple  approaches  from  previous  work,  and  also  outper¬ 
forms  each  of  its  Hmm-T  and  Rel-grams  compo¬ 
nents  taken  in  isolation. 

3.1  Experimental  Methodology 

The  corpus  used  for  our  experiments  consisted  of  a 
sample  of  sentences  taken  from  Web  pages.  From 
an  initial  crawl  of  nine  million  Web  pages,  we  se¬ 
lected  sentences  containing  relations  between  proper 
nouns.  The  resulting  text  corpus  consisted  of  about 

4Pre-computing  the  set  C  requires  identifying  in  advance 
the  potential  relation  arguments  in  the  corpus.  We  consider  the 
proper  nouns  identified  by  the  Lex  method  (see  Section  2.1)  to 
be  the  potential  arguments. 


three  million  sentences,  and  was  tokenized  as  de¬ 
scribed  in  Section  2.  For  tractability,  before  and  after 
performing  tokenization,  we  replaced  each  token  oc¬ 
curring  fewer  than  five  times  in  the  corpus  with  one 
of  two  “unknown  word”  markers  (one  for  capital¬ 
ized  words,  and  one  for  uncapitalized  words).  This 
preprocessing  resulted  in  a  corpus  containing  about 
sixty-five  million  total  tokens,  and  214,787  unique 
tokens. 

We  evaluated  performance  on  four  relations: 
Conquered,  Founded,  Headquartered,  and 
Merged.  These  four  relations  were  chosen  because 
they  typically  take  proper  nouns  as  arguments,  and 
included  a  large  number  of  sparse  extractions.  For 
each  relation  R ,  the  candidate  extraction  list  Er  was 
obtained  using  TextRunner  (Banko  et  ah,  2007). 
TextRunner  is  an  IE  system  that  computes  an  in¬ 
dex  of  all  extracted  relationships  it  recognizes,  in  the 
form  of  (object,  predicate,  object)  triples.  For  each 
of  our  target  relations,  we  executed  a  single  query 
to  the  TextRunner  index  for  extractions  whose 
predicate  contained  a  phrase  indicative  of  the  rela¬ 
tion  ( e.g .,  “founded  by”,  “headquartered  in”),  and 
the  results  formed  our  extraction  list.  For  each  rela¬ 
tion,  the  10  most  frequent  extractions  served  as  boot¬ 
strapped  seeds.  All  of  the  non-seed  extractions  were 
sparse  (no  argument  pairs  were  extracted  more  than 
twice  for  a  given  relation).  These  test  sets  contained 
a  total  of  361  extractions. 

3.2  Type  Checking  Experiments 

As  discussed  in  Section  2.2,  on  sparse  data  Hmm-T 
has  the  potential  to  outperform  type  checking  meth¬ 
ods  that  rely  on  textual  similarities  of  context  vec¬ 
tors.  To  evaluate  this  claim,  we  tested  the  Hmm-T 
system  against  an  N- GRAMS  type  checking  method 
on  the  task  of  type-checking  the  arguments  to  a  re¬ 
lation.  The  N-GRAMS  method  compares  the  context 
vectors  of  extractions  in  the  same  way  as  the  Rel- 
GRAMS  method  described  in  Section  2.3,  but  is  not 
relational  (N-GRAMS  considers  the  distribution  of 
each  extraction  argument  independently,  similar  to 
Hmm-T).  We  tagged  an  extraction  as  type  correct  iff 
both  arguments  were  valid  for  the  relation,  ignoring 
whether  the  relation  held  between  the  arguments. 

The  results  of  our  type  checking  experiments  arc 
shown  in  Table  1.  For  all  types,  Hmm-T  outper¬ 
forms  N-GRAMS,  and  Hmm-T  reduces  error  (mea- 


Type 

Hmm-T 

N-GRAMS 

Conquered 

0.917 

0.767 

Founded 

0.827 

0.636 

Headquartered 

0.734 

0.589 

Merged 

0.920 

0.854 

Average 

0.849 

0.712 

Table  1:  Type  Checking  Performance.  Listed  is  area 
under  the  precision/recall  curve.  Hmm-T  outper¬ 
forms  N-GRAMS  for  all  relations,  and  reduces  the 
error  in  terms  of  missing  area  under  the  curve  by 
46%  on  average. 

sured  in  missing  area  under  the  precision/recall 
curve)  by  46%.  The  performance  difference  on  each 
relation  is  statistically  significant  ( p  <  0.01,  two- 
sampled  t-test),  using  the  methodology  for  measur¬ 
ing  the  standard  deviation  of  area  under  the  preci¬ 
sion/recall  curve  given  in  (Richardson  and  Domin¬ 
gos,  2006).  N-GRAMS,  like  Rel-GRAMS,  employs 
the  BM-25  metric  to  measure  distributional  similar¬ 
ity  between  extractions  and  seeds.  Replacing  BM- 
25  with  cosine  distance  cuts  Hmm-T’s  advantage 
over  N-GRAMS,  but  Hmm-T’s  error  rate  is  still  23% 
lower  on  average. 

3.3  Experiments  with  Realm 

The  Realm  system  combines  the  type  checking 
and  relation  assessment  components  to  assess  ex¬ 
tractions.  Here,  we  test  the  ability  of  Realm  to 
improve  the  ranking  of  a  state  of  the  art  IE  system, 
TextRunner.  For  these  experiments,  we  evalu¬ 
ate  Realm  against  the  TextRunner  frequency- 
based  ordering,  a  pattern-learning  approach,  and  the 
Hmm-T  and  Rel-grams  components  taken  in  iso¬ 
lation.  The  TextRunner  frequency-based  order¬ 
ing  ranks  extractions  in  decreasing  order  of  their  ex¬ 
traction  frequency,  and  importantly,  for  our  task  this 
ordering  is  essentially  equivalent  to  that  produced  by 
the  “Urns”  (Downey  et  al.,  2005)  and  Pointwise  Mu¬ 
tual  Information  (Etzioni  et  al.,  2005)  approaches 
employed  in  previous  work. 

The  pattern-learning  approach,  denoted  as  Pl,  is 
modeled  after  Snowball  (Agichtein,  2006).  The  al¬ 
gorithm  and  parameter  settings  for  Pl  were  those 
manually  tuned  for  the  Headquartered  relation 
in  previous  work  (Agichtein,  2005).  A  sensitivity 
analysis  of  these  parameters  indicated  that  the  re- 


Conquered 

Founded 

Headquartered 

Merged 

Average 

Avg.  Prec. 

0.698 

0.578 

0.400 

0.742 

0.605 

TextRunner 

0.738 

0.699 

0.710 

0.784 

0.733 

Pl 

0.885 

0.633 

0.651 

0.852 

0.785 

Pl+  Hmm-T 

0.883 

0.722 

0.727 

0.900 

0.808 

Hmm-T 

0.830 

0.776 

0.678 

0.864 

0.787 

Rel-grams 

0.929  (39%) 

0.713 

0.758 

0.886 

0.822 

Realm 

0.907  (19%) 

0.781  (27%) 

0.810  (35%) 

0.908  (38%) 

0.851  (39%) 

Table  2:  Performance  of  Realm  for  assessment  of  sparse  extractions.  Listed  is  area  under  the  preci¬ 
sion/recall  curve  for  each  method.  In  parentheses  is  the  percentage  reduction  in  error  over  the  strongest 
baseline  method  (TextRunner  or  Pl)  for  each  relation.  “Avg.  Prec.”  denotes  the  fraction  of  correct 
examples  in  the  test  set  for  each  relation.  Realm  outperforms  its  Rel-grams  and  Hmm-T  components 
taken  in  isolation,  as  well  as  the  TextRunner  and  Pl  systems  from  previous  work. 


suits  arc  sensitive  to  the  parameter  settings.  How¬ 
ever,  we  found  no  parameter  settings  that  performed 
significantly  better,  and  many  settings  performed 
significantly  worse.  As  such,  we  believe  our  re¬ 
sults  reasonably  reflect  the  performance  of  a  pattern 
learning  system  on  this  task.  Because  Pl  performs 
relation  assessment,  we  also  attempted  combining 
Pl  with  Hmm-T  in  a  hybrid  method  (Pl+  Hmm-T) 
analogous  to  Realm. 

The  results  of  these  experiments  arc  shown  in  Ta¬ 
ble  2.  Realm  outperforms  the  TextRunner  and 
Pl  baselines  for  all  relations,  and  reduces  the  miss¬ 
ing  area  under  the  curve  by  an  average  of  39%  rel¬ 
ative  to  the  strongest  baseline.  The  performance 
differences  between  Realm  and  TextRunner  arc 
statistically  significant  for  all  relations,  as  arc  differ¬ 
ences  between  Realm  and  Pl  for  all  relations  ex¬ 
cept  Conquered  ( p  <  0.01,  two-sampled  t-test). 
The  hybrid  Realm  system  also  outperforms  each 
of  its  components  in  isolation. 

4  Related  Work 

To  our  knowledge,  Realm  is  the  first  system  to  use 
language  modeling  techniques  for  IE  Assessment. 

Redundancy-based  approaches  to  pattern-based 
IE  assessment  (Downey  et  al.,  2005;  Etzioni  et  al., 
2005)  require  that  extractions  appeal-  relatively  fre¬ 
quently  with  a  limited  set  of  patterns.  In  contrast. 
Realm  utilizes  all  contexts  to  build  a  model  of  ex¬ 
tractions,  rather  than  a  limited  set  of  patterns.  Our 
experiments  demonstrate  that  Realm  outperforms 
these  approaches  on  sparse  data. 


Type  checking  using  named-entity  taggers  has 
been  previously  shown  to  improve  the  precision  of 
pattern-based  IE  systems  (Agichtein,  2005;  Feld¬ 
man  et  al.,  2006),  but  the  Hmm-T  type-checking 
component  we  develop  differs  from  this  work  in  im¬ 
portant  ways.  Named-entity  taggers  are  limited  in 
that  they  typically  recognize  only  small  set  of  types 
(e.g.,  ORGANIZATION,  LOCATION,  PERSON), 
and  they  require  hand-tagged  training  data  for  each 
type.  Hmm-T,  by  contrast,  performs  type  check¬ 
ing  for  any  type.  Finally,  Hmm-T  does  not  require 
hand-tagged  training  data. 

Pattern  learning  is  a  common  technique  for  ex¬ 
tracting  and  assessing  sparse  data  ( e.g .  (Agichtein, 
2005;  Riloff  and  Jones,  1999;  Pa§ca  et  al.,  2006)). 
Our  experiments  demonstrate  that  Realm  outper¬ 
forms  a  pattern  learning  system  closely  modeled  af¬ 
ter  (Agichtein,  2005).  Realm  is  inspired  by  pat¬ 
tern  learning  techniques  (in  particular,  both  use  the 
distributional  hypothesis  to  assess  sparse  data)  but 
is  distinct  in  important  ways.  Pattern  learning  tech¬ 
niques  require  substantial  processing  of  the  corpus 
after  the  relations  they  assess  have  been  specified. 
Because  of  this,  pattern  learning  systems  are  un¬ 
suited  to  Open  IE.  Unlike  these  techniques,  Realm 
pre-computes  language  models  which  allow  it  to  as¬ 
sess  extractions  for  arbitrary  relations  at  run-time. 
In  essence,  pattern-learning  methods  run  in  time  lin¬ 
eal-  in  the  number  of  relations  whereas  Realm's  run 
time  is  constant  in  the  number  of  relations.  Thus, 
Realm  scales  readily  to  large  numbers  of  relations 
whereas  pattern-learning  methods  do  not. 


A  second  distinction  of  Realm  is  that  its  type 
checker,  unlike  the  named  entity  taggers  employed 
in  pattern  learning  systems  ( e.g .,  Snowball),  can  be 
used  to  identify  arbitrary  types.  A  final  distinction  is 
that  the  language  models  Realm  employs  require 
fewer  parameters  and  heuristics  than  pattern  learn¬ 
ing  techniques. 

Similar  distinctions  exist  between  Realm  and  a 
recent  system  designed  to  assess  sparse  extractions 
by  bootstrapping  a  classifier  for  each  target  relation 
(Feldman  et  al.,  2006).  As  in  pattern  learning,  con¬ 
structing  the  classifiers  requires  substantial  process¬ 
ing  after  the  target  relations  have  been  specified,  and 
a  set  of  hand-tagged  examples  per  relation,  making 
it  unsuitable  for  Open  IE. 

5  Conclusions 

This  paper  demonstrated  that  unsupervised  language 
models,  as  embodied  in  the  Realm  system,  arc  an 
effective  means  of  assessing  sparse  extractions. 

Another  attractive  feature  of  Realm  is  its  seal- 
ability.  Scalability  is  a  particularly  important  con¬ 
cern  for  Open  Information  Extraction ,  the  task  of  ex¬ 
tracting  large  numbers  of  relations  that  arc  not  spec¬ 
ified  in  advance.  Because  Hmm-T  and  Rel-GRAMS 
both  pre-compute  language  models,  Realm  can  be 
queried  efficiently  to  perform  IE  Assessment.  Fur¬ 
ther,  the  language  models  arc  constructed  indepen¬ 
dently  of  the  target  relations,  allowing  Realm  to 
perform  IE  Assessment  even  when  relations  arc  not 
specified  in  advance. 

In  future  work,  we  plan  to  develop  a  probabilistic 
model  of  the  information  computed  by  Realm.  We 
also  plan  to  evaluate  the  use  of  non-local  context  for 
IE  Assessment  by  integrating  document-level  mod¬ 
eling  techniques  {e.g..  Latent  Dirichlet  Allocation). 
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